A CNN-transformer fusion network for COVID-19 CXR image classification

The global health crisis due to the fast spread of coronavirus disease (Covid-19) has caused great danger to all aspects of healthcare, economy, and other aspects. The highly infectious and insidious nature of the new coronavirus greatly increases the difficulty of outbreak prevention and control. The early and rapid detection of Covid-19 is an effective way to reduce the spread of Covid-19. However, detecting Covid-19 accurately and quickly in large populations remains to be a major challenge worldwide. In this study, A CNN-transformer fusion framework is proposed for the automatic classification of pneumonia on chest X-ray. This framework includes two parts: data processing and image classification. The data processing stage is to eliminate the differences between data from different medical institutions so that they have the same storage format; in the image classification stage, we use a multi-branch network with a custom convolution module and a transformer module, including feature extraction, feature focus, and feature classification sub-networks. Feature extraction subnetworks extract the shallow features of the image and interact with the information through the convolution and transformer modules. Both the local and global features are extracted by the convolution module and transformer module of feature-focus subnetworks, and are classified by the feature classification subnetworks. The proposed network could decide whether or not a patient has pneumonia, and differentiate between Covid-19 and bacterial pneumonia. This network was implemented on the collected benchmark datasets and the result shows that accuracy, precision, recall, and F1 score are 97.09%, 97.16%, 96.93%, and 97.04%, respectively. Our network was compared with other researchers’ proposed methods and achieved better results in terms of accuracy, precision, and F1 score, proving that it is superior for Covid-19 detection. With further improvements to this network, we hope that it will provide doctors with an effective tool for diagnosing Covid-19.


Introduction
Covid-19 is a pulmonary disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in 2019 and is highly infectious, mutagenic, and Covid- 19  strains such as delta, omicron, and omicron XE variant have become pandemic worldwide [1,2]. On 30 January 2020, the World Health Organization (WHO) recognized the outbreak as a public health emergency of international concern (PHEIC) [3] and identified it as a pandemic on 11 March 2020 [4]. According to the WHO weekly epidemiological update on COVID-19 (12 April 2022), as of 10 April 2022, over 496 million confirmed cases and over 6 million deaths have been reported globally [5]. Therefore, detecting Covid-19 positive patients in the population at an early stage is not only important to curb the virus transmission and mutation, but also crucial to make disease staging and present treatment plans. Currently, the main method for testing Covid-19 patients worldwide is Reverse transcription polymerase chain reaction (RT-PCR) [6]. RT-PCR is the gold standard for detecting viral Ribonucleic Acid (RNA), however, in some cases, the sensitivity of RT-PCR appears to be lower than computed tomography (CT), with 71% vs 98% according to the reports. The lower sensitivity is caused by the possible inadequate supply of reagents, the lack of expertise required for the testing, low viral load in patients, and the long testing cycles [7]. Unfortunately, if Covid-19 mutates during transmission, the epidemic will likely spread rapidly, with large numbers of cases appearing instantly. As a result, the task of rapidly detecting Covid-19 in a large population poses a great challenge to medical institutions worldwide.
Medical imaging and deep learning (DL) can play an important role in pre-detection efforts to combat disease. In recent years, researchers have used deep neural networks to achieve remarkable results in a variety of fields. Recent advances in DL show that computers can extract more information from images, more reliably, and more accurately than ever before [8,9]. However, further developing and optimizing DL techniques for the characteristics of medical images and medical data remains important but challenging research [10]. For example, ground-glass opacities are evident on chest X-ray or CT images for patients with Covid-19 [11,12]. Thus, a chest radiology-based system could be an effective way to detect, quantify and track Covid-19 cases. Furthermore, nature-inspired and heuristic optimization algorithms have been successfully adopted for various applications of medical images. For example, the use of a heuristic red fox heuristic optimization algorithm (RFOA) for medical image segmentation [13]. Nowadays, DL is increasingly applied to medical image classification, object detection, segmentation, and other tasks, and is replacing traditional machine learning methods in medical imaging [14].
Convolutional neural networks (CNN) [15][16][17][18][19][20][21]  expansion-projection-extension (PEPX) design, which can enhance the representation capability greatly and reduce the computational complexity, achieving better classification results [28]. Khan et al proposed a CNN-LSTM and improved max value features optimization framework to address the issue of multisource fusion and redundant features [29] and they also proposed a deep learning and explainable AI technique to select the best features for the diagnosis and classification of COVID-19 [30]. Arias-Garzón et al. proposed a new approach using existing DL models, which focuses on enhancing pre-processing stage to obtain accurate and reliable classification results. The pre-processing stage consists of a projection-based filtering network to divide the data into frontal or lateral, a segmentation model to extract lung regions containing relevant information, and a migration learning VGG classification model for classification with an accuracy of 97% [31]. Ahmed et al. studied four classification methods based on X-ray images and CT from three aspects: pre-processing, feature extraction, and classification, and proposed the use of Convid-Net to classify Covid-19 with an accuracy of 97.99% [32]. Islam et al. proposed a detection system for Covid-19 based on the combination of LSTM (Long Short-Term Memory) and CNN, where CNN was used for deep feature extraction and LSTM was used to classify the extracted features, with an accuracy of 99.4% [33].
Recently, the application of transformer [34][35][36][37][38][39][40][41][42][43] to the area of computer vision tasks increasingly demonstrates unique advantages. Vision transformer (ViT) uses a combination of a self-attention mechanism and a multi-layer perceptron (MLP), which reflects complex spatial transformations and long-range feature dependencies. Unlike that the CNN pays attention to local features, the transformer focuses on the global representation of images. Inspired by the transformer's success in natural language processing (NLP), Dosovitskiy et al applied a standard transformer directly to images, with the fewest possible modifications. To do so, an image is split into patches and the sequence of linear embeddings of these patches are provided as an input to a transformer. Image patches are treated the same way as tokens (words) in an NLP application [35].
For the detection of Covid-19 caused by SARS-CoV-2, this study proposes a classification network with CNN and transformer fusion, which automatically classifies the chest radiographs acquired during medical examinations. This network could assist doctors to judge whether a patient contracts pneumonia, furthermore to detect Covid-19 or bacterial pneumonia.
Data sets are collected from different medical institutions to enhance the applicability and robustness of this model. Data differences between medical institutions are reduced in the data processing stage, and a CNN-transformer fusion network is utilized for classification.
The main contributions of this work are as follows.

A fusion network of CNN and transformer is presented for COVID-19 CXR image classification.
2. Both local and global features are obtained and fed into two branches for feature extraction and finally fused for classification.

Data processing
The data processing stage includes data transformation, data augmentation and adaptive normalization.

Data transformation
Datasets collected in this study come from different medical institutions, in image or DICOM file formats. Thus, the data will be converted to the tensor of the same resolution to ensure their uniformity.

Data augmentation
There are two problems regarding model universality in DL: the large amount of data required to train the model and the imbalance of data category. The number of training samples varies greatly from category to category, which causes problems in the learning process of the classification task. To address these issues, image translation and rotation are required in this work to augment the dataset and balance the category. Image translation: The chest X-ray image will be randomly translated horizontally and vertically, (Δx,Δy) is the amount of random translation and is determined by the image resolution.
Image rotation: The chest X-ray image will be rotated randomly clockwise or counterclockwise around the geometric center, θ is the angle of random rotation and is also determined by the image resolution.

Normalization
Medical images differ significantly from natural images in terms of dynamic range, natural images have a dynamic range of 3×255, while medical images can have a dynamic range of several thousand, such as CT, and even medical images may have a dynamic range of floatingpoint data, such as X-rays.

PLOS ONE
image I to the interval [a,b] according to Eq (1).
In Eq (1), I is the original image, I Max and I Min are the maximum and minimum values extracted from I. In this study, b and a are specified as 1 and 0 respectively, and the normalized image I N is calculated by this equation.
CNN and transformer network CNN module. In DL, a CNN is a class of artificial neural network (ANN) commonly applied to image processing. CNN, thought to be shift-invariant and space-invariant, is based on the shared-weight architecture of the convolution kernels or filters that slide along input features and provide translation-equivariant responses known as feature maps. CNN uses multiple convolutional kernels at different levels to collect local features of images for representation and it has a unique advantage in extracting local features of images. CNN could enrich the extraction of hierarchical features and enhance its representation as the depth of the network increases. The residual structure in ResNet can effectively solve the network degradation as the depth of the network increases [18]. Fig 4(A) shows a basicblock, in which BacthNormal follows the down-sampled 3×3 spatial convolution and the 3×3 spatial convolution, and the identity mapping by shortcuts is between the basicblock input and the convolution output. The convolution module in this work contains L (L>1) basicblocks.

PLOS ONE
Transformer module. Transformer, an architecture consisting of self-attention and MLP, uses the multi-head attention mechanism to obtain spatial transformations and long-range feature dependencies of an image to extract global features of the image.
Self-attention is the core of the transformer, which corresponds to the queue and a set of values to the input, forming a mapping of query q, key k, and value v to the output. The output can be seen as a weighted sum of values, and the weight is derived from self-attention. Through the self-attention mechanism, the inputs of x 1 and x 2 are transformed into z 1 and z 2 , and the equations are as follows.
W Q , W K and W V are three weight matrices. In the above equations, x 1 and x 2 share the same weight matrix W, and by this operation, information exchange is made between the vectors of x 1 and x 2 . z 1 and z 2 are obtained by the linear combination of v 1 and v 2 , and θ is the combination of weights. The equation is as follows.
Attention ðQ; K; VÞ ¼ softmax The encoder part of the transformer module for image classification tasks is often used in ViT, as shown in Fig 4(B). Each transformer encoder contains a multi-head self-attention module and an MLP module, and LayerNorm is before the multi-head self-attention module and the MLP module. The embedded patches input is connected to the transformer encoder using residuals.

Proposed network
The features of medical images include obvious local lesion features and scattered global features. Thus, this study proposes a three-stage image classification network based on CNNtransformer, which consists of feature extraction, feature focus, and feature classification subnetworks. In this model, the local features of the image are extracted using the convolution module and the global features of the image are extracted by the transformer for fusion. This fusion could obtain the lesion features with both local and global features and get better classification results.
Feature extraction sub-network. The structure of the feature extraction sub-network is shown in Fig 5. The image tensor is convolved by a 5×5 convolution kernel with the stride of 2 and a 3×3 convolution kernel with the stride of 2. We measured the effect of different convolution fields in Section 4, and the extraction of local features with the convolution kernel is 5×5 is better. Batch normalization (BN) and rectified linear unit (ReLU) follow each convolution layer. However, the feature map extracted by CNN does not match the feature dimension of the transformer. In detail, the feature vector extracted by the CNN is H × W × C (H, W and C are height, width, and channel, respectively), while the encoded shape after the transformer is (N + 1) × D (N, 1, D are the number of patches, category labels, and output dimension, respectively). In this study, feature maps extracted by the convolution layer are converted into 7×7 patches by the customized convolution and transformer feature interaction module, then the downsample of patches through a linear layer is the same as embedding patch in dimension, and is added to embedding patch for feature fusion. The global features of the multi-head self-attention concerns are converted into a 56×56 tensor, and the tensor is projected using the 1×1 convolution layer and added to the tensor using 3×3 convolution layer for feature fusion. This makes it possible to realize information interaction of the local features with the global features.
Feature focus sub-network. The feature focus sub-network is made up of two modules, the CNN and the transformer. The CNN branch consists of 3 different basicblock modules, the first consisting of 2 with a step size of 1 and a convolutional kernel of 64, the second consisting of 2 with a step size of 2 and a convolutional kernel of 128, and the third consisting of 2 with a step size of 2 and a convolutional kernel of 256. The transformer branch consists of 8 transformer modules. As Fig 6 shows, the local features by the convolutional extraction are becoming more and more complex and abstract, while the global features are aggregated through the self-attention mechanism of the transformer.
Feature classification sub-network. Fig 7 shows the structure of the feature classification sub-network, where the features extracted from different modules of the feature focus sub-network are fused to obtain local and global features, and the final feature vectors are generated by the Average Pool and linear layers and are outputted through the softmax layer to predict the categories of Covid-19, normal and bacteria pneumonia.

Experiment and analysis
This section is concerned with the evaluation of the proposed model. To begin with, the data set and the parameters setting are specified to start the experiment. Next, the proposed model is compared with some DL-based models on this dataset, then with some other models regarding the detection of Covid-19.

Dataset
The data in this study come from three medical institutions: Guangzhou Women and Children Medical Centre dataset [44], MIDRC-RICORD [45][46][47], COVIDx CXR dataset [28]. The categories of collected data are Covid-19, bacterial pneumonia, and normal. Data pre-processing was performed on the collected data and the distribution of the data after pre-processing is shown in Table 1.

Experimental setup
Accuracy, Precision, Recall, and F1 score were used as the evaluation metrics. The experiments were carried out on the 64-bit operating system of Red Hat 4.8.5-28. The 4-card parallel training was conducted on Intel(R) Xeon(R) E5-2630 and Tesla M60 GPU, and each graphics card was executed on a server with a storage capacity of 8 GB memory. Under Pytorch version 1.9.1, CUDA 10.2 and CUDNN 7.6, the model was built and trained, with the training parameters in Table 2.   Table 3 shows the evaluation metrics of the different models on the present dataset. Table 4 illustrates the proposed model in comparison to some other models regarding the detection of Covid-19.
The results show that the proposed model is more suitable than other models in classifying Covid-19 images. This may be due to that local and global features of the lesion are equally important for the diagnosis of Covid-19. In the previous study, researchers tend to focus more on local features and realize lesion classification by aggregating local features. The network proposed in this study focuses both on local lesion features and scattered global features in Covid-19 images. The fusion of two features solves the problem caused by paying more attention to local features than global features in characterizing lesions, thus achieving better results.

Discussion
This section compares the effects of possible module combinations, different convolutional kernel sizes, and mutual fusion and one-way fusion as well.

Possible module combinations
As well known that local features become progressively more abstract as the convolution layer gets deeper, and global features become progressively more decreasing as the transformer is extracted. Our proposed network can extract different features and fuse features from different branches through mutual feature fusion to reduce the loss of useful feature information. Discussing the fusion of the local features with the global features, we conducted some experiments to test possible module combinations. As shown in Table 5, the fusion of the Transformer_block1 and the Conv_block1 gains the best results.

Different convolution kernel size
CNN acquires local features through convolutional kernels, by which different feature maps are outputted. Feature extraction sub-network extracts local features and the transformer extracts global features for fusion, achieving good experimental results on the experimental

PLOS ONE
dataset. To verify the performance of the proposed model, this study also explores the fusion effect of feature maps with different sizes of convolutional kernels and global information From Table 6, we can see that the Feature extraction sub-network achieves better results when the convolutional kernel size is 5. Compared with other convolutional kernel size settings, this sub-network can extract more accurate detailed features and can better complement the local information missing from the global features, thus achieving better results.

Mutual fusion and one-way fusion
Our network demonstrates that the fusion of global feature and local feature gets excellent results in Covid-19 classification. The comparations include the fusion from CNN to   Table 7 compares the effect of three fusion methods and the results show that mutual fusion works best.

Conclusions
In this study, a CNN-transformer fusion network is proposed for Covid-19 image classification. This network could make the best of the different feature extraction capabilities of the CNN and the transformer, the CNN module and the transformer module are able to extract the local features and the global features of medical images, respectively. Therefore, this network fuses local lesion features and scattered global features to achieve classification, focusing

PLOS ONE
on both the global and the local features. The experiments show that the proposed network performs better than other DL-based models for the classification of Covid-19, bacterial pneumonia, and the normal. The comparison of the proposed model with other models concerning Covid-19 reveals that our model is good at detecting Covid-19 in CXR images, and achieves superior results compared to other models. There is still room for improvement in the following areas. Decentralized data from different institutions is required to improve the classification of the model. The model remains to be refined to distinguish between more types of pneumonia disease, and further developed as a computer-aided diagnosis system for pneumonia.